Skip to content

IFU v2.14.dev0#557

Draft
ipanfilo wants to merge 115 commits into
devfrom
IFU-dev-20260315-v2.14
Draft

IFU v2.14.dev0#557
ipanfilo wants to merge 115 commits into
devfrom
IFU-dev-20260315-v2.14

Conversation

@ipanfilo
Copy link
Copy Markdown
Collaborator

Description

IFU from 2026-03-15 upstream commit 708d7c1
https://github.com/ROCm/frameworks-internal/issues/16312

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

ptrendx and others added 30 commits January 20, 2026 09:14
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
* update FE to 1.17

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add determinism flag

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add determinism to test

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add determinism to qa/

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* move bias/dbias/versioning/dropout logic to C API

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update qa/L0_pytorch_unittest/test.sh

make .xml file specific to deterministic tests in qa/

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add determinism to Jax extension

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add determinism to Jax tests

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update tests/jax/test_fused_attn.py

fix typo

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* Update transformer_engine/common/fused_attn/fused_attn.cpp

fix indentation

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix the AI fixes

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix Jax extension call

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* minor fixes based on comments

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix selection logic and fwd arg

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix version check in Jax test

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix pytorch CI failures

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix Jax CI failures

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix non-/determinism logic and CI

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix formatting

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update transformer_engine/common/fused_attn/fused_attn.cpp

fix and/or logic

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* update to 9.18.1 for requirement

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* reduce Jax CI tests for determinism

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
* Implemented persistent nvfp4 kernel

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix FP4 guard in ptx

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fix

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fix in ptx. reduxf32 guard

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixes per PR review

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixes per PR review. Added parameter to turn off the persistency

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Modified reference CPU implementation in C++ unit tests to match GPU (numerical truncation). Tightened the numerical tolerance

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Disabled persistency by default, as non-persistent kernel is more performant when inputs are large

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Use the tuned kernel also for the rowwise only quantization

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixed typo

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Addressed comments from the PR review

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Resolved conflicts

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Macros renaming

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

---------

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
* PoC of the changes

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Early exit from the Free function for the empty tensor

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Use the proper function for nvtx range

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Only do mark_not_offload when the cpu_offloading is enabled

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* First pass on making the setattr issue not come back

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Actually add pytest.ini

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Changes to __init__

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* A different way

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* WAR the fact that it is not possible to set __setattr__ dynamically

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Simpler solution and fixes

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix for the inference mode DPA

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Start of debugging debug tools

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* More fixes in debug

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Speculative moving the validate_name to the constructor

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Making the debug tools names saner

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Change the setattr usage in the tensor parallel group setting

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Adding try/finally - it does not seem to impact the time in observable
way

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fixing lint issues and the thunder test

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix 1 of the debug tests

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Removed the warning and enforcement in the CI

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* try-finally in the context manager

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fixing the debug tests

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix cb.CUDAOptions usage for Triton 3.6.0

Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update utils.py

Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

* Update utils.py

Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

* Update utils.py

Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

---------

Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
return tokens_per_experts always

Signed-off-by: tdophung <tdophung@nvidia.com>
* Update THD sink attention logic for newer cudnn versions

THD Sink attention is supported in 9.18.0

Signed-off-by: Chen Cui <chcui@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update thd sink attention logic for cp>1

Signed-off-by: Chen Cui <chcui@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add unit test for thd + sink attention

Signed-off-by: Chen Cui <chcui@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* address comments

Signed-off-by: Chen Cui <chcui@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* do not skip thd cp sink attention test

Signed-off-by: Chen Cui <chcui@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* disable deterministic mode for sink attention

Signed-off-by: Chen Cui <chcui@nvidia.com>

---------

Signed-off-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
* SWA (left, right) with FusedAttention changes cherry-picked from NVIDIA/TransformerEngine#1369

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix test_kv_cache failures

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* remove unnecessary comments

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* fix some more filter issues, address feedback

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* fix for local test case failures - `bottom_right_diagonal` should be calculated in `fused_attn_fwd` call as well

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* make conditions more accurate

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* add cp tests to test swa (left, right)

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove dead code and make conditions better

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix lint

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* feedback form Charlene

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* small er

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* plumb `bottom_right_diagonal` through jax

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* plumb `bottom_right_diagonal` through jax

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add missing fields

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* use proper mask type in CP

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Use correct block size for workspace in row id map creation, also shard workspace correctly based on 2nd dim of routing_map/row_id map

Signed-off-by: DoubleCheeseCheetos <hanhdp99@gmail.com>

* reduce size of largest test case on single_GPU scenario to fit on L40 and A100 in CI line up

Signed-off-by: tdophung <hanhdp99@gmail.com>

---------

Signed-off-by: DoubleCheeseCheetos <hanhdp99@gmail.com>
Signed-off-by: tdophung <hanhdp99@gmail.com>
Co-authored-by: DoubleCheeseCheetos <hanhdp99@gmail.com>
* Disabled the tuned NVFP4 kernels

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Disabled fast math in cpp tests

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

---------

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
* Expose option for custom op fusions

Refactor fusion functions to remove index bookkeeping. Refactor fused ops to use consistent operation order.

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Add tests for custom ops

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix linter warnings and numerical test failures

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Tweak pattern matching logic with fixed window sizes

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Use TF32 tols in fused op tests

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Review suggestion from @greptile-apps

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Backpropagate fixes from #2622

Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix(examples): te_llama compatibility with HuggingFace transformers >= 4.57

The te_llama.py example was failing with HuggingFace transformers 4.57+
due to API changes in how decoder layer outputs are handled.

Changes:
- Handle case where hidden_states is passed as a tuple (older HF versions)
- Return tensor directly instead of wrapped in tuple (HF 4.57+ expects this)
- Fix regex pattern to use raw string (fixes SyntaxWarning)

Error fixed:
  AttributeError: 'tuple' object has no attribute 'contiguous'

Tested with:
- transformer_engine 2.5.0
- transformers 4.57.3
- PyTorch container nvcr.io/nvidia/pytorch:25.08-py3

Signed-off-by: Santosh Bhavani <santosh.bhavani@live.com>

* docs(te_llama): add requirements.txt

Signed-off-by: Santosh Bhavani <santosh.bhavani@live.com>

* fix(docs): add missing notebook output names

Signed-off-by: Santosh Bhavani <santosh.bhavani@live.com>

---------

Signed-off-by: Santosh Bhavani <santosh.bhavani@live.com>
…x 404 error (#2625)

* Use "nyu-mll/glue" instead of "glue" for encoder datasets to fix 404 error

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* rename mnist dataset path

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* add dataset manifest

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
* jjit bug fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix'

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fixes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* lint fixes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* code drop

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add FP8 scale support and fix alignment for grouped GEMM

- Add FP8 scale_inv pointer handling in nvte_grouped_gemm for proper FP8 GEMM
- Fix random padding in tests to ensure 16-byte alignment for all dtypes
- Reorder GroupedGemmSetupWorkspace members for natural alignment
- Remove debug prints

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Grouped GEMM: code cleanup and NULL C support

- Remove unused alignment parameter from GroupedGemmSetupWorkspace::from_buffers
- Simplify select_grouped_operand by removing dead code branches
- Add GroupedOperandSelection.tensor field to avoid passing tensor separately
- Extract set_fp8_scale_pointers and init_matrix_layouts helpers
- Add safety check for FP8 on Hopper column-wise fallback
- Support NULL C tensor when beta=0 (uses D as placeholder)
- Remove unused get_scale_inv() from test
- Add use_null_c test parameter and test case
- Fix documentation: alpha/beta are single element tensors only

Signed-off-by: Piotr Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Grouped GEMM: per-matrix alpha/beta support

- Change alpha/beta from single values to per-matrix arrays
- Validate alpha/beta have exactly num_tensors elements
- Update kernel to index alpha_ptr[idx] and beta_ptr[idx]
- Move alpha/beta validation to validate_grouped_gemm_inputs
- Update tests to use per-matrix alpha/beta arrays
- Update documentation

Signed-off-by: Piotr Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix alpha/beta numel - use SimpleTensor::numel()

Signed-off-by: Piotr Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* Refactor: move grouped GEMM to separate file and cleanup API

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* Require Blackwell (SM100) and cuBLAS 13.1+ for grouped GEMM

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fixes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update transformer_engine/common/gemm/config.h

Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* changed

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* suggestions

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* refactored hopper tensor selection

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Piotr Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com>
fix wheel

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* version change

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* ifx

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* Code drop: Update recipes documentation and remove custom recipes from low precision training

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* Fix SVG css import path for diagrams

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* Refactor low_precision_training docs: remove optimizers, fix imports, add GPU checks

Changes:
- Remove optimizer code from all recipe examples (keep only forward/backward)
- Fix Format imports (use Format.E4M3 instead of string 'E4M3')
- Fix params_dtype for PyTorch examples (add params_dtype=torch.bfloat16)
- Add GPU capability assertions before START blocks for blockwise/mxfp8/nvfp4
- Fix JAX imports (Float8CurrentScaling from common.recipe, NVFP4BlockScaling)
- Add global_shard_guard for TransformerLayer examples in JAX
- Fix fused_layers_jax.py return tuple unpacking
- Update memory_usage JAX examples with dynamic GPU measurement
- Remove memory_usage_3_jax (JAX doesn't support FP8 weight storage)
- Update performance_considerations.rst for JAX differences
- Delete unused .out files and fp8_autocast_jax.py

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix JAX memory usage .out files with correct output

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* responded to comments

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* applied suggestions form greptile

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* year change

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* jax compute capability fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fixes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Support building with headers from nvidia wheels

There are two changes:
1. `import nvidia` returns a namespace package with `__file__` equal to `None`
2. Add the way to force headers from nvidia wheels. Without that envvar, it's practically impossible with CUDA installed system-wide.

I successfully built the package with torch using the following `uv` configuration:
```
[tool.uv.extra-build-dependencies]
"transformer-engine-torch" = [
    "ninja",
    "nvidia-cuda-crt==13.0.88",
    "nvidia-cuda-cccl==13.0.85",
    { requirement = "torch", match-runtime = true },
    { requirement = "pytorch-triton", match-runtime = true },
    { requirement = "nvidia-cusolver", match-runtime = true },
    { requirement = "nvidia-curand", match-runtime = true },
    { requirement = "nvidia-cublas", match-runtime = true },
    { requirement = "nvidia-cusparse", match-runtime = true },
    { requirement = "nvidia-cudnn-cu13", match-runtime = true },
    { requirement = "nvidia-nvtx", match-runtime = true },
    { requirement = "nvidia-cuda-nvrtc", match-runtime = true },
    { requirement = "nvidia-cuda-runtime", match-runtime = true },
]
```

Signed-off-by: Vadim Markovtsev <vadim@poolside.ai>

* Apply suggestion from @ksivaman

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Vadim Markovtsev <vadim@poolside.ai>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Fixed scaling-factor computation for FP32 to match the reference implementation.

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Uncommented the tuned kernel path

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* init

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* year update in license

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Update README.rst

Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

* Update README.rst

Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

* Update README.rst

Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

---------

Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>
Signed-off-by: Kaining Zhong <kainingz@nvidia.com>
* Rebased to main

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed the year to 2026

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Added compilation guards

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Added BWD pass

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Added dbias and dact tests. Refactoring.

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Added grouped MXFP8 DACT and ACT API and tests

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed a typo

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixes per the review

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* More fixes from the review

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixes per the review

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Relaxed requirement for last dim from mod128 to mod32

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Added alignment checks when tensor descriptors are modified

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: vthumbe1503 <vthumbe@nvidia.com>
bucket max_b with more granularity when >512

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
* expand troubleshooting docs

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Update README.rst

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

* Update README.rst

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

* Update README.rst

Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

---------

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
* add grad reduce api for cuda graph hook

Signed-off-by: Pingtian Li <pingtianl@nvidia.com>

* fix code consistency

Signed-off-by: Pingtian Li <pingtianl@nvidia.com>

---------

Signed-off-by: Pingtian Li <pingtianl@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
* Fix the compilation warnings for the PyTorch extension

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Apply suggestion from @greptile-apps[bot]

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com>

---------

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Bias92 and others added 26 commits March 10, 2026 10:30
…nput (#2746)

* Fix cross_entropy_forward stride guard for non-contiguous input

Signed-off-by: Bias92 <pewpewplay315@gmail.com>

* Add regression test for non-contiguous transposed input

Signed-off-by: Bias92 <pewpewplay315@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Bias92 <pewpewplay315@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Implemented the kernel with split dbias

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Relaxed constraints on the last dimension

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Added notes on group tensor restrictions into documentation

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixes per the review

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixed pointer

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* More fixes

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixed kernel grid size

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

---------

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* ccache size limit

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* ccache

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* ccache

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* ccache

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* ccache

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* ccache

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* ccache

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* ccache

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* ccache

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* ccache

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* ccache

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* ccache

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* ccache

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* ccache

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* ccache

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* ccache

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* added blackwell

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* added blackwell

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* Revert build.yml changes unrelated to deploy nightly docs

Restore .github/workflows/build.yml to upstream/main state.
Only deploy_nightly_docs.yml changes are relevant to this PR.

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
… vmapped seg offsets (#2692)

* Fix batcher for when segment ids received are batched/vmapped whereas the TE constructed segment pos are not thereby causing mismatches in impl()

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* nit: Fix the shape check for assert

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix the batcher logic to check for q and kv seg ids separately

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Remove batcher logic to expand segment pos. Keep the shape check asserts.

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Add support for vmapped seg id and non vmapped seg pos when computing the seqlens and offsets for fused attn

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Undo batcher check logic for seg pos and seg ids as it is already moved to get_seqlens_and_offsets()

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* nit: Remove unnecessary assert check

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* nit: Code clean up

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* nit: Use partial instead of a single use function

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

---------

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…t aligned (#2747)

* fallback

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

* warn once

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

---------

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
* code drop

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [Debug] Pass tp_size to DebugQuantizer and use it in get_reduction_params

Use tp_size to determine whether tensor parallelism is active instead of
checking tp_group is None (which is ambiguous since None means world group
in torch.distributed). Also add tp_size to the backward-compat kwargs
filtering in call_feature so custom features without tp_size in their
inspect_tensor signature continue to work.

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Add CPU offloading documentation

Documents the get_cpu_offload_context() API with examples for basic usage,
manual synchronization, and CUDA graphs integration. Adds new
other_optimizations section to the documentation structure.

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* Fix copyright year to 2026 in cpu_offloading.rst

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* Add missing copyright headers to CPU offloading example files

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* Improve cpu_offloading docs: legacy params note, manual sync clarifications

- Add comment about legacy parameters in function signature
- Clarify that num_layers is ignored (not forbidden) when manual_synchronization=True
- Document num_layers constraint: must be <= model_layers-2 for overlap
- Add note that offload_stream.synchronize() must precede release_activation_forward_gpu_memory()

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* Remove incorrect note about offload_stream.synchronize()

release_activation_forward_gpu_memory() internally waits for offload
completion via CUDA events - explicit synchronize() is not required.

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* Update docs/features/other_optimizations/cpu_offloading/cpu_offloading.rst

Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>

* update

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fixes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fixes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fixes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* docs: review fixes for cpu offloading documentation

- Export mark_not_offload and ManualOffloadSynchronizer from
  transformer_engine.pytorch and add them to API reference
- Fix condition 3 description (xi is needed as input to backward,
  not computed after it completes)
- Fix offload_weights docstring default value (False, not True)
- Fix legacy params comment to link to API reference
- Fix RST spacing around inline code in Fig. 3/4 captions
- Add note explaining retain_pinned_cpu_buffers history and
  pytorch#167507 fix landing in PyTorch 2.11
- Fix typo: seqeuences -> sequences
- Add trailing newline to other_optimizations/index.rst

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…ling (#2741)

add guard at bisected jax version where lower is segfault

Signed-off-by: tdophung <tdophung@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix pylint: remove unused lru_cache import and fix import order in helper.py

Signed-off-by: tdophung <tdophung@nvidia.com>

* Guard Triton tests against JAX < 0.8.0 using release version check

- Add version_utils.py with is_triton_extension_supported() checking JAX >= 0.8.0
  (release version, not dev snapshot) and TRITON_EXTENSION_MIN_JAX_VERSION constant
- Add pytest.mark.triton marker and conftest hook to skip marked tests on old JAX
- Add require_triton() for module-level skipping in test files
- Rewrite triton_extensions to use is_triton_extension_supported() instead of
  direct jaxlib dev-version comparison

Signed-off-by: tdophung <tdophung@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Address review: allow_module_level, drop is_triton_extension_supported re-export, revert test.sh

- require_triton(): add allow_module_level=True to pytest.skip() so module-level
  calls on old JAX produce a proper skip instead of a collection failure
- Remove is_triton_extension_supported from triton_extensions/utils.py __all__:
  importing triton_extensions on JAX < 0.8.0 raises immediately, so re-exporting
  the check from there defeats its purpose; callers should import directly from
  transformer_engine.jax.version_utils
- Revert qa/L0_jax_lint/test.sh TE_PATH to /opt/transformerengine (local dev
  path was accidentally committed; pass TE_PATH= at invocation time instead)

Signed-off-by: tdophung <tdophung@nvidia.com>

* Address review: move version guard before gpu_triton import, fix __all__ and hardcoded version

- Move is_triton_extension_supported() guard before the gpu_triton import block
  with a comment clarifying the segfault is at dispatch time, not import time
- Remove _jax_version_meet_requirement from version_utils __all__ (private helper,
  not a public API; callers import it explicitly as needed)
- Use TRITON_EXTENSION_MIN_JAX_VERSION constant in conftest marker description
  instead of hardcoded '0.8.0'

Signed-off-by: tdophung <tdophung@nvidia.com>

* address more comments

Signed-off-by: tdophung <tdophung@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: tdophung <tdophung@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…(#2751)

* Support configurable number of philox rounds for SR during build

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* format and lint

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Added better error messages

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Update transformer_engine/pytorch/distributed.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixes

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

---------

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Kaining Zhong <kainingz@nvidia.com>
* First pass

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Cleaning the dtype usage in dequantize and distributed

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add fake_dtype to get_metadata

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix to make_like

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

---------

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…e_function_fwd to fp32 (#2752)

* Fixed aval for intermediate results (softmaxed/sigmoided logits) to pass as residuals to CompType, which is currently fp32. This prevents incorrect reading of this buffer when logits dtype used are not fp32

Signed-off-by: tdophung <tdophung@nvidia.com>

* address comments of inconsitency in style and NVTE CHECK for fp32 type

Signed-off-by: tdophung <tdophung@nvidia.com>

* revert the remaining Comptype checking, address greptile suggestion

Signed-off-by: tdophung <tdophung@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: tdophung <tdophung@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Initial commit to pass scale as Tensor for multi_tensor_scale op

Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

* Enable capturable mode for optimizer if store_param_remainders is passed but not actually enabled

Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

* Revert "Enable capturable mode for optimizer if store_param_remainders is passed but not actually enabled"

This reverts commit 74a9bccf0fadd4159f70d28da49a533ea7c76108.

Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

* Apply suggestion from @greptile-apps[bot]

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

* Change noop_flag to is_infinite

Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

* Update transformer_engine/pytorch/csrc/extensions.h

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Remove duplication

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Add test for scale tensor cuda

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* MXFP8 grouped GEMM + tensor-scaled FP8 fixes

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Change version to 13.3

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* Random padding condition shouldnt be done for mxfp8

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* Remove incorrect comment

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* CUBLAS > 13.2 is enough

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* CUBLAS version needed for MXFP8 indeed seems to be 13.3

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* Accidental line removal added back. Plus need changes ci t trigger

Add documentation for scaling factors in common.h

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* Update cuBLAS version requirement for MXFP8 support

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* grouped gemm: address code review comments

- Replace nvte_set/get_grouped_tensor_swizzled_scales with nvte_set_grouped_tensor_param
- Add host-side validation: A and B must use same scaling mode (both MXFP8 or both tensor scaling)
- Add host-side validation: A and B must both be FP8 or both non-FP8; restrict inputs to FP8/BF16
- Restrict output (C/D) to BF16/FP32; remove FP16 from supported types
- Refactor workspace allocation: replace manual offset arithmetic with moving pointer pattern
- Use void* + NVTEScalingMode in setup kernel instead of separate float*/char* scale params
- Extract use_columnwise(swap_dims) helper to eliminate duplicated MXFP8 columnwise blocks
- Split set_fp8_scale_pointers into set_fp8_scale_pointers / set_mxfp8_scale_pointers
- Remove scale_inv_ptrs from GroupedOperandSelection; pass workspace pointers directly
- Move swizzled-scales validation into validate_grouped_gemm_inputs for fail-fast behavior
- Add use_split_accumulator to GroupedMatmulConfig (Hopper only, default false)
- Add FP8 test case with per-tensor scales; add BF16/MXFP8 shape-varying test cases

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: vthumbe1503 <vthumbe@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Pawel Gadzinski <pgadzinski@nvidia.com>
…tensors. #2120" (#2673)

* Adds dst.dtype information in copy_ method of quantized tensors.

Signed-off-by: Zhiyi Su <dantesuu@gmail.com>

* Update transformer_engine/pytorch/tensor/quantized_tensor.py

Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: ZhiyiDanielSu <35579247+zobeideThePlayer@users.noreply.github.com>

* Update transformer_engine/pytorch/quantized_tensor.py

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix reference tensor copy

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Zhiyi Su <dantesuu@gmail.com>
Signed-off-by: ZhiyiDanielSu <35579247+zobeideThePlayer@users.noreply.github.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Zhiyi Su <dantesuu@gmail.com>
Co-authored-by: ZhiyiDanielSu <35579247+zobeideThePlayer@users.noreply.github.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Fuse scale + 0 + cumulative sum for splits to offsets calc

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Add unit test and fix bug in kernel for >256 size

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix race

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* check for logical_last_dim > 0

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* suggestions

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
* Added new people to CI

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Removing duplicate

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

---------

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
… parallelism (#2688)

* Error out if constructing LayerNormLinear with row tensor parallelism

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Disable Userbuffers test for row-TP LayerNormLinear

Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Enable cgemm + FP8 tests

* Implement CGEMM + MXFP8

* added size check for mxfp8

* added tols for assertions

* update tests with recipes

* enable tests + is_quantize_recipe_supported

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…kScaling and Float8BlockScaling quantized model init. (#2753)

* Updates FusedAdam with FSDP2 and MXFP8

Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com>

* removes xfailing unit test for MXFPr MXFP8

Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com>

* addresses comments related to reset parameters and guard against self.capturable

Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com>

* adds e2e unit test

Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com>

* adds test to non meta device init

Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com>

* attempts to add float8block scaling fsdp hooks

Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com>

* adds e2e test for Float8BlockScaling

Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com>

* addresses review comments and code cleanup

Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com>

* more review comments addressed

Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com>

* removes unused block_len param

Signed-off-by: Jonathan Mitchell <jomitchell@umb-b300-dp-147.ipp4a1.colossus.nvidia.com>

* fixes failing unit test because we still need to xfail nvfp4 dcp

Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* lint - replacing todo with note

Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1429.ipp1a1.colossus.nvidia.com>

---------

Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com>
Signed-off-by: Jonathan Mitchell <jomitchell@umb-b300-dp-147.ipp4a1.colossus.nvidia.com>
Signed-off-by: Jonathan Mitchell <jomitchell@ipp1-1429.ipp1a1.colossus.nvidia.com>
Co-authored-by: Jonathan Mitchell <jomitchell@ipp1-1334.ipp1a1.colossus.nvidia.com>
Co-authored-by: Jonathan Mitchell <jomitchell@umb-b300-dp-147.ipp4a1.colossus.nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: vthumbe1503 <vthumbe@nvidia.com>
Co-authored-by: Jonathan Mitchell <jomitchell@ipp1-1429.ipp1a1.colossus.nvidia.com>
Signed-off-by: Peter St. John <pstjohn@nvidia.com>
* fix for async dcp checkpointing

Signed-off-by: Peter St. John <pstjohn@nvidia.com>

* Apply suggestions from code review

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Peter St. John <peterc.stjohn@gmail.com>

* Update transformer_engine/pytorch/tensor/storage/float8_tensor_storage.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Address Greptile review feedback: defensive guards for edge cases

- Add _quantizer None guard in new_empty dispatch
- Replace self.is_cpu with explicit _data/_transpose checks in __reduce_ex__
- Make get_metadata() safe for cleared tensors (both _data and _transpose None)

Signed-off-by: Peter St. John <pstjohn@nvidia.com>

---------

Signed-off-by: Peter St. John <pstjohn@nvidia.com>
Signed-off-by: Peter St. John <peterc.stjohn@gmail.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
…ped Tensor Swizzling (#2669)

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove changes not needed for bf16

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* keep only pytorch binding for now

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* linting error

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* add fast accumulator support

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* MXFP8 grouped GEMM + tensor-scaled FP8 fixes

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Change version to 13.3

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* fix the test

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Random padding condition shouldnt be done for mxfp8

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* Remove incorrect comment

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* CUBLAS > 13.2 is enough

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* CUBLAS version needed for MXFP8 indeed seems to be 13.3

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* all changes for grouped gemm

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Accidental line removal added back. Plus need changes ci t trigger

Add documentation for scaling factors in common.h

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* Update cuBLAS version requirement for MXFP8 support

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* Update transformer_engine/common/gemm/cublaslt_grouped_gemm.cu

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* grouped gemm: address code review comments

- Replace nvte_set/get_grouped_tensor_swizzled_scales with nvte_set_grouped_tensor_param
- Add host-side validation: A and B must use same scaling mode (both MXFP8 or both tensor scaling)
- Add host-side validation: A and B must both be FP8 or both non-FP8; restrict inputs to FP8/BF16
- Restrict output (C/D) to BF16/FP32; remove FP16 from supported types
- Refactor workspace allocation: replace manual offset arithmetic with moving pointer pattern
- Use void* + NVTEScalingMode in setup kernel instead of separate float*/char* scale params
- Extract use_columnwise(swap_dims) helper to eliminate duplicated MXFP8 columnwise blocks
- Split set_fp8_scale_pointers into set_fp8_scale_pointers / set_mxfp8_scale_pointers
- Remove scale_inv_ptrs from GroupedOperandSelection; pass workspace pointers directly
- Move swizzled-scales validation into validate_grouped_gemm_inputs for fail-fast behavior
- Add use_split_accumulator to GroupedMatmulConfig (Hopper only, default false)
- Add FP8 test case with per-tensor scales; add BF16/MXFP8 shape-varying test cases

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* address reviee comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* missed merged conflict handling

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor change

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* forgot adding a or

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* resolve merge conflicts

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* address review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* address minor review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* remove unecessary code

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* one line that broke everything :(

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* address review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* address review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update transformer_engine/common/gemm/cublaslt_grouped_gemm.cu

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* Update transformer_engine/common/gemm/cublaslt_grouped_gemm.cu

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* address review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* unecessary

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* revert caching changes

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* Update transformer_engine/common/gemm/cublaslt_grouped_gemm.cu

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* fix minor bug from greptile

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* revert for now

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* address review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* address review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* address review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* address review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* address review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* address review commentsgp

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

---------

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jeremy Berchtold <jberchtold@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Pawel Gadzinski <pgadzinski@nvidia.com>
@ipanfilo ipanfilo requested review from Micky774 and aris134 April 21, 2026 21:17
@ipanfilo ipanfilo added the ci-level 3 CI test level 3 label May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-level 3 CI test level 3

Projects

None yet

Development

Successfully merging this pull request may close these issues.